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Abstract: Wc consider the compound decision problem of estimating a 
vector of n parameters, known up to a permutation, corresponding to n in- 
dependent observations, and discuss the difference between two symmetric 
classes of estimators. The first and larger class is restricted to the set of all 
permutation invariant estimators. The second class is restricted further to 
simple symmetric procedures. That is, estimators such that each parameter 
is estimated by a function of the corresponding observation alone. We show 
that under mild conditions, the minimal total squared error risks over these 
two classes are asymptotically equivalent up to essentially O(l) difference. 
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1. Introduction 

Let T = {F^ : /x G M} be a parametrized family of distributions. Let Yy, Y2 ■ ■ ■ 
be a sequence of independent random variables, where Yi G y and Y{ ~ F fii , 
i = 1,2, ... . For each n, we suppose that the sequence /z i:n is known up to 
a permutation, where for any sequence x = (xi, X2, ■ ■ ■ ) we denote the sub- 
sequence x s , . . . , Xt by x s -.t- We denote by fi = fi n the set . . . ,/x n }, i.e., /x 
is /Lti :n without any order information. We consider in this note the problem 
of estimating it by p,i :n under the loss X)ILi(Ai — A*i) 2 ) where j&i :n = A(Yi : „). 
We assume that the family T is dominated by a measure v, and denote the 
corresponding densities simply by fi = f lli , i = 1, ...,n. The important example 
is, of course, = N(m, 1). 

Let T> s = T>n be the set of all simple symmetric decision functions A, that 
is, all A such that A(Yi : „) = (6(Yx), . . . , 6(Y n )), for some function 6 : y — > 
A4. In particular, the best simple symmetric function is denoted by A^ = 
*£(y„)): 

A^ = argminE||A-/xi :Il || 2 , 
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and denote 

= E||A£(y 1:m )- Ml: „|| 2 , 

where, as usual, ||ai :n || 2 = J27=i a i- 

The class of simple rules may be considered too restrictive. Since the /is are 
known up to a permutation, the problem seems to be of matching the Ys to 
the /xs. Thus, if Yi ~ N(/j,i, 1), and n = 2, a reasonable decision would make jjL\ 
closer to \x\ A //2 as Yi gets larger. The simple rule clearly remains inefficient 
if the /xs are well separated, and generally speaking, a bigger class of decision 
rules may be needed to obtain efficiency. However, given the natural invariance 
of the problem, it makes sense to be restricted to the class V PI = T> PI of all 
permutation invariant decision functions, i.e, functions A that satisfy for any 
permutation it and any (Yi, ...,Y n ): 

A(Yi, ...,Y n ) = (Ai, An) A(F 7r( i),..,3^ r ( n )) = ...,/«„(„)). 

Let 

A^ = argminE||A(y")-/i 1: „|| 2 

be the optimal permutation invariant rule under (i, and denote its risk by 
r PI = E\\A PI (Y 1:n )-» Un \\ 2 . 

Obviously V s C T> PI , and whence > r PI . Still, 'folklore', theorems in the 
spirit of De Finetti, and results like Hannan and Robbins (1955), imply that 
asymptotically (as n — > oo) A PI and A^„ will have 'similar' risks. Our main 
result establishes conditions that imply 

r s H -r PI = 0(1). 

To repeat, n is assumed known in this note. In the general decision theory 
framework the unknown parameter is the order of its member to correspond 
with Y\ :n , and the parameter space, therefore, corresponds to the set of all the 
permutation of 1, ... ,n. 

An asymptotic equivalence as above implies, that when we confine ourselves 
to the class of permutation invariant procedures, we may further restrict our- 
selves to the class of simple symmetric procedures, as is usually done in the 
standard analysis of compound decision problems. The later class is smaller and 
simpler. 

The motivation for this paper stems from the way the notion of oracle is used 
in some sparse estimation problems. Consider two oracles both know the value 
of [i. Oracle I is restricted to use only a procedure from the class V PI , while 
Oracle II is restricted to use only procedures from V s . Obviously Oracle I has an 
advantage, our results quantify this advantage and show that it is asymptotically 
negligible. Furthermore, starting with Robbins (1951) various oracle- inequalities 
were obtained showing that one can achieve nearly the risk of Oracle II, by a 
'legitimate' statistical procedure. See, e.g., the survey Zhang (2003), for oracle- 
inequalities regarding the difference in risks. See also Brown and Greenshtein 
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(2007), and Jiang and Zhang (2007) for oracle inequalities regarding the ratio 
of the risks. However, Oracle II is weak, and hence, these claims may seem to 
be too weak. Our equivalence results, extend many of those oracle inequalities 
to be valid also with respect to Oracle I. We needed a stronger result than the 
usual objective that the mean risks are equal up to o(l) difference. Many of the 
above mentioned recent applications of the compound decision notion are about 
sparse situations when most of the (is are in fact 0, the mean risk is o(l), and 
the only interest is in total risk. 

Let Hx, . . . , n n be some arbitrary ordering of fi. Consider now the Bayesian 
model under which (tt, Yi-„), tt a random permutation, have a distribution given 
by 

tt is uniformly distributed over V(l : n): 

(1.1) 

Given tt, Y\- n are independent, Yi ~ F^ , i = 1, ■ ■ ■ ,n, 

where for every s < t, V(s : t) is the set of all permutations of s, . . . ,t. The 
above description induces a joint distribution of (Mi, M n , Y\, Y n ), where 
Mi = (J,^^, for a random permutation tt. 

The first part of the following proposition is a simple special case of general 
theorems representing the best invariant procedure under certain groups, as 
Bayes with respect to the appropriate Haar measure; for background see, e.g., 
Bcrgcr (1985), Chapter 6. The second part of the proposition was derived in 
various papers starting with Robbins (1951). 

In the following proposition and proof, E^.^ is the expectation under the 
model in which the observations are independent and Yi ~ E M is the ex- 
pectation under the above joint distribution of Yi :n and M\- n . Note that under 
the latter model, marginally Mj ~ G„, the empirical measure defined by /x, and 
conditional on Mi = to, Y\ ~ F m , i = 1, . . . , n. 

Proposition 1.1. The best simple and permutation invariant rules are given 
by 

(i) &l I {Y v . n )=E tl (M 1:n \Y 1:n ). 
(H) 6^{Y i )=E tl (M i \Y i ),i = l,...,n. 

(in) Suppose E„ ||A* - A^|| 2 = 0( £ 2 J, then r s n = + 0(e 2 ). 

Proof. We need only to give the standard proof of the third part. First, note that 
by invariance is an equalizer (over all the permutations of fi), and hence 
E Ml!B (A« - mi-) 2 - E M (A« - M 1:n f. Also E Ml: „(A^ - Ml:n ) 2 = E„(A^ - 
M 1:n ) 2 . Then, given the above joint distribution, 

r£ = E M ||A£-M 1:n f 

= E tl E IM {\\Af l -M 1:n f\Y 1:n } 

= E M E M {||A£ - A^f + ||Af - M 1:n f\Y 1:n } 

= r^ + 0{el). 

□ 
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We now briefly review some related literature and problems. On simple sym- 
metric functions, compound decision and its relation to empirical Bayes, see 
Samuel (1965), Copas (1969), Robbins (1983), Zhang (2003), among many other 



Hannan and Robbins (1955) formulated essentially the same equivalence 
problem in testing problems, see their Section 6. They show for a special case 
an equivalence up to o(n) difference in the 'total risk' (i.e., non-averaged risk). 
Our results for estimation under squared loss are stated in terms of the total 
risk and we obtain 0(1) difference. 

Our results have a strong connection to De Finetti's Theorem. The exchange- 
ability induced on M±, M n , by the Haar measure, implies 'asymptotic in- 
dependence' as in Definetti's theorem, and consequently asymptotic indepen- 
dence of Yi, ...,Y n . Thus we expect £'(Mi|Fi) to be asymptotically similar to 
E(M\\Yi, Y n ). Quantifying this similarity as n grows, has to do with the rate 
of convergence in DeFinetti's theorem. Such rates were established by Diaconis 
and Freedman (1980), but are not directly applicable to obtain our results. 

After quoting a simple result in the following section, we consider in Section 

3 the special important, but simple, case of two-valued parameter. In Section 

4 we obtain a strong result under strong conditions. Finally, the main result is 
given in Section 5, it covers the two preceding cases, but with some price to pay 
for the generality. 

2. Basic lemma and notation 

The following lemma is standard in comparison of experiments theory; for back- 
ground on comparison of experiments in testing see Lehmann (1986), p-86. The 
proof follows a simple application of Jensen's inequality. 

Lemma 2.1. Consider two pairs of distribution, {Gq,Gi} and {Go,Gi}. Sup- 
pose that there exists a Markov kernel K such that Gi(-) — J K(y, ■) dGi(y), 
i = 1,2. Then 



for any convex function ip 

For simplicity denote /i(-) = f^i'), and for any random variable X, we may 
write X ~ g if g is its density with respect to a certain dominating measure. 
Finally, for simplicity we use the notation y^i to denote the sequence y\, . . . , y n 
without its i member, and similarly fi _ i = {fj.\, . . . , ^ n }/ f-i- Finally f-i(Y-j) is 
the marginal density of Y—j under the model (1.1) conditional on Mj = Hi. 

3. Two valued parameter 

We suppose in this section that \x can get one of two values which we denote by 
{0, 1}. To simplify notation we denote the two densities by /o and f\. 



papers. 
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Theorem 3.1. Suppose that either of the following two conditions hold: 

(i) /i_ At (Yi)// A1 (Yi) has a finite variance under fj, 6 {0, 1}. 
fa) S™=i Mi/ n ^ 7 £ (0, 1), and /i_ At (Yi)// Al (Yi) has a finite variance under 
fi = or fj, = 1 . 

T/ienE||/t s -/i PJ || 2 = 0(l). 

Proof. Suppose condition (i) holds. Let K = Yli=i Mi; an d suppose, wlog, that 
A' < n/2. Consider the Bayes model of (1-1)- By Bayes Theorem 

P( Ml = IK) = ™ 



A/ 1 (y 1 ) + (n-x)/ (r 1 )■ 



where, with some abuse of notation fkiYz-.k) is the joint density of Yz.n condi- 
tional on X)J=2 Mi = O n the other hand 

p(Mj = i|y 1:n ) 



A/i(Yi)/k-i(T 2: „) + (n - ^)/o(Yi)/^(F 2: „) 

^/i(*l) , (n-*Q/ (n) / /* 



V + KU (Yi ) + (n-K) UY ) [ f,r-i 1 2: " J 



A7i(Vi) + (n - A)/o(Yi) V A7i(Yi) + (n- ^)/o(Y) fx-i 

= p(^ (1) = i|y 1 )(i + o(-^(r 2: „)-i; 

We use Lemma 2.1 to compare the testing between /^(Y^fc) vs. //f-i(Y 2 :fc) to 
an easier problem, from which the original problem can be obtained by adding 
a random permutation. Suppose for simplicity and wlog that in fact Yi.k are 
i.i.d. under fi, while Yk+i-.u & re i.i.d. under /q. Then we compare 

K n 
9K-l(Y2:n)=l[A(Y j ) [J f (Yj), 
3=2 j=K+l 



the true distribution, to the mixture 



9K{Y2:n) = 9K-x{Y2:n)— „ £ T^" 
n A i=KT+l 70 



However, the likelihood ratio between g^- and gx-x is a sum of n—K terms, each 
with mean (under gj(-i) and finite variance. It is, therefore, 1 + O p (n -1 / 2 ) in 
the mean square. 

Consider now the second condition. By assumption, K is of the same order 
as n, and we can assume, wlog, that the /i//o has a finite variance under /o. 
With this understanding, the above proof holds for the second condition. □ 

The condition of the theorem is clearly satisfied in the normal shift model: 
Fi = N(fj,i, 1), i = 1,2. It is satisfied for the normal scale model, Fi = N(0, erf), 
i = 1,2, if K is of the same order as n, or if Cq/2 < a\ < 2(Tq. 
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4. Dense fis 

We consider now another simple case in which \x can be ordered M(i), ■ ■ ■ jMfn) 
such that the difference fJin+i) — fin) is uniformly small. This will happen if, for 
example, fi is in fact a random sample from a distribution with density with 
respect to Lebesgue measure, which is bounded away from on its support, 
or more generally, if it sampled from a distribution with short tails. Denote by 
Y(l), ■ ■ ■ , Y(n) an d /(i), ■ ■ ■ , /(„) the corresponding ordering of the Ys and /s. 
We assume in this section 

(Bl) For some slowly converging to infinity constants A n and V n : 

max \fii - fJ,j\ = A n 

Note that condition (Bl) holds for both the normal shift model and the normal 
scale model, if fi behaves like a sample from a distribution with a density as 
above. 

Theorem 4.1. If Assumption (Bl) holds then 

n 

5>f' -^ = O p {AlV 2 Jn). 

i=l 

Proof. By definition 

. s _ Y%=*j*f*0Q 

..PI - 

Ml YZ=lfi(Yi)f-i{Y 2:n ) 

where is the density of l2:n under 

f-i{{V2:m)) = (n ^ lV J2 II /^)(%) 4 (?/*)• 

The result will follow if we argue that max, |/-i(l2:„)//-i(l2:n) — 1| — - * 1- 
In fact we will establish a slightly stronger claim that 

||/-i-/-i||jv->0 

where || ■ \\tv denotes the total variation norm. 

We will bound this distance by the distance between two other densities. 
Let g_i(jj2:n) = YYj=2 fj(Vj)i ^ ne * rue distribution of Y2-. n . We define now a 
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similar analog of Let rj and y( r ) be defined by fj = f( r ) an d 2/(r ) = Uji 
j = 1, . . . , n. Suppose, for simplicity, that Ti < r±. Let 

n— l <■ 

g-i(V2:n) =g-l{V2:n) \\ (%))■ 

Note that depends only on /x_j. Moreover, if Y2:n ~ <?-j) then one can obtain 
^2:n ~ /-j by the Markov kernel that takes Y2-.n to a random permutation of 
itself. It follows from Lemma 2.1 

II/-; - /-i]|tv < ||fl-i -5-i||tv 

= ^. n \—(Y2:n)-l 

I 9-1 



But, by assumption 



4=0^0)) 



is a reversed L2 martingale, and it follows from Assumption (Bl) that 

max|i? fc - 1| = O p (A n V n /n). 

Similar argument applies to i, Ti > ri, yielding 

max||/_j - !-i\\tv = O p (A n V n /n) 

i 

But then we argue 

f—i 

\fJ-i 1 - /-if I < max|^i - /ij|(max— —{Y 2:n ) - l) = Op(^„F n /n). 
»,j M /-j 

The theorem follows. □ 



5. Main result 

We assume that for some C < 00: 

(Gl) We have max ie {i ) ... i „} |/x*| < C and maxi je i i ...^ l E Mi (/ AIj .(Yi)// Ali (Yi)) 2 < 
C. Finally there is 7 > such that min^gi „ P Mj {f flj (Yi)// Mi (Yi) > 
7) > 1/2. 



imsart-generic ver. 2007/02/20 file: definettiSubmitted.tex date: February 10, 2008 



Greenshtein and Ritov/simple estimators for compound decisions 



8 



(G2) Let 



Then 



Pj (Y i )= J^] i,j 1 n. 



EEEfe^)) 2 < c 

i=i i=i 

n i 

y E : 7^77 < Cn 

rr{ nrmxijPjiYi) 



ELi(Pi(K)) 



2 



Both assumptions describe a situation where the ^s do not "separate" . They 
cannot be too far one from another, geometrically or statistically (Assumption 
(Gl)), and they are dense in the sense that each Y can be explained by many 
of the /us (Assumption (G2)). The conditions hold for the normal shift model 
if fi n are uniformly bounded: Suppose the common variance is 1 and < A n . 
Then 



2 



np -Y?+2A n Yi 

< E 



( ne -(Y 1 2 -2A n Y 1 +Al)/2j2 

1. 



: Ee 4A„y I 
n 



n n 

and the first part of (G2) hold. The other parts follow a similar calculations. 
Theorem 5.1. Assume that (Gl) and (G2) hold. Then 

E||A*-A^|| 2 = 0(1) (i) 
r£-C = 0(l). (ii) 

Corollary 5.2. Suppose J- = {N(/i, 1) : \fi\ < c} for some c < oo, then the 
conclusions of the theorem follow. 

Proof. It was mentioned already in the introduction that when we are restricted 
to permutation invariant procedure we can consider the Bayesian model under 
which (tt, Yi:n), 7T a random permutation, have a distribution given by (1.1). Fix 
now i € {1, . . . , n}. Under this model we want to compare 

= E (Pn(i)\ Y i)> 1 = 1,..., 71 
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More explicitly: 



fif 1 = E{/j, w (i)\Yu n ), i = l,...,n. 



.pi 



^HjPjiYi), i = l,...,n 
3=1 

S" = i^7 3 -(^)/- 3 -(^-i) 

n 

= J2 toPsOQWjOr^Yi), i = i,...,n, 

3=1 

where for all i,j = l,...,n, fj(Y) was defined in Section 2, and 

fj(Y<) 



(5.1) 



Wj{Y-i,Yi) = 



Yr k =ifk{YiY 
f-j(y-i) 

ELi^(^)/-fe(^- 



Note that El-=i-Pfc(^) = 1j an d Wj(Y-i,Yi) is the likelihood ratio between 
two (conditional on Y{) densities of Y-i, say gjo and g\. Consider two other 
densities (again, conditional on Yi): 



9j 



o(Y^ i \Y i ) = MY,) ] [ f m (Y m ), 



~ 9jl (Y^ \Y)= ho (Y-i \Y) ( Yl P" (Y ) T ( y * ) + Pi (Y ) 4 OS ) + fi (Yi) 



Note that <7jo = 9j0 K and gi = g 3 i o K, where IK is the Markov kernel that 
takes Y_i to a random permutation of itself. It follows from Lemma 2.1 that 



Ed^-CK^)-!! 2 ^) <E Sjl 



E 



■OIL _ i 
hi 

ho 2 

hi 



m 
ho 



(5.2) 



This expectation does not depend on j: i is related only to the Ys while j is 
related only to /x. Hence, to simplify notation, we take wlog j = i. Denote 



L = 



9a 

ho 



Pi (Y^ + Y,Pmy(Yk) 



V = -i min pj(Yi), 
4 j 
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where 7 is as in (Gl). Then by (5.2) 

EflWi-C-i, Yi) - 1| 2 |^-) < E 5 . (i - 2 + L 



lE, 30 (L-l) 2 l(L>y)+E^l^l (5 ' 3) 



fc=l 



by Gl. Bound 



L>7nunp fc (yo£l(4(n)>7)>7niinpfc(«)(l + ^), 
where {/ ~ B(n — 1, 1/2) (the 1 is for the ith summand). Hence 



1(L<V) 1 ^ 1 /n-1 

9 '° I " 7minfcPfc(y0 £J A; + 1 ^ fc 



-n+l 



i I n \o-«+i (5-4) 



EL,' 



inmm.kPk(Yi) \ k + 1 
0(e-») 



771 min fc pfc(Yj) 



by large deviation. 

From (Gl), (G2), (5.1), (5.3), and (5.4): 



(E Wi ( y *) ( w i ( y -*. y - !)) I y ) 

3=1 ' 



< max|/i,| 2 EEft(y)E(VK,(y_ l ,y ! ) - l) ) |K, 
3 3=1 



for some k large enough. Claim (i) of the theorem follows. Claim (ii) follows (i) 
by Proposition 1.1. 

□ 
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